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Abstract 

Failure detection protocols—a fundamental building block for crafting 
fault-tolerant distributed systems—are in many cases described by their au¬ 
thors making use of informal pseudo-codes of their conception. Often these 
pseudo-codes use syntactical constructs that are not available in COTS pro¬ 
gramming languages such as C or C-H-. This translates into informal descrip¬ 
tions that call for ad hoc interpretations and implementations. Being infor¬ 
mal, these descriptions cannot be tested by their authors, which may translate 
into insufficiently detailed or even faulty specifications. This paper tackles 
this problem introducing a formal syntax for those constructs and a C library 
that implements them—a tool-set to express and reason about failure detec¬ 
tion protocols. The resulting specifications are longer but non ambiguous, 
and eligible for becoming a standard form. 


1 Introduction 

Failure detection constitutes a fundamental building block for crafting fault-tolerant 
distributed systems, and many researchers have devoted their efforts on this direc¬ 
tion during the last decade. Failure detection protocols are often described by their 
authors making use of informal pseudo-codes of their conception. Often these 
pseudo-codes use syntactical constructs such as repeat periodically H] |2j O, 
at time t send heartbeat 101131, at time t check whether message has ar¬ 
rived 101, or upon receive E], together with several variants (see Table [^. We 
observe that such syntactical constructs are not often found in COTS programming 
languages such as C or C-i-i-, which brings to the problem of translating the protocol 
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Conslrucl 

NFD-E [4] 

pm 

FDlH 

GMFD gl 

ve<^vm 

nB [21 

nB-pt [21 

Repeal 

periodically 

no 

no 

yes 

no 

yes 

yes 

yes 

Upon t = 
currenl lime 

yes 

no 

yes 

yes 

no 

no 

no 

Upon receive 
message 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

Concurrency 

managemenl 

yes 

yes 

no 

no 

yes 

yes 

yes 


Table 1: Syntactical constructs used in several failure detector protocols, (p is 
the accrual failure detector Q. 2? is the eventually perfect failure detector of |T]. 
T-LB is the Heartbeat detector f2|. T-LB-pt is the partition-tolerant version of the 
Heartbeat detector. By “Concurrency management” we mean coroutines, threading 
or forking. 


specifications into running software prototypes using one such standard language. 
Furthermore the lack of a formal, well-defined, and sfandard form fo express failure 
defection protocols offen leads fheir aufhors fo insufficienfly defailed descripfions. 
Those informal descripfions in fum complicafe fhe reading process and exacerbafe 
fhe work of fhe implemenfers, which becomes fime-consuming, error-prone and af 
times frusfrafing. 

Several researchers and pracf if loners are currenfly arguing fhaf failure defecfion 
should be made available as a nefwork service EH. To fhe besf of our knowledge 
no such service exisfs fo dale. Lacking such tool, if is imporfanf fo devise mefhods 
to express in fhe applicafion layer of our soffware even fhe mosf complex failure 
defection profocols in a simple and nafural way. 

In fhe following we infroduce one such mefhod—a class of “fime-oufs”, i.e., 
objecfs fhaf posfpone a cerfain function call by a given amounf of lime. This fealure 
converfs lime-based evenls info non fime-based evenls such as message arrivals 
and easily expresses fhe conslrucls in Table [T] in sfandard C. In some cases, our 
class removes fhe need of concurrency managemenl requiremenls such as corou¬ 
tines or Ihread managemenl libraries. The formal characler of our mefhod allows 
rapid-profolyping of fhe algorilhms wilh minimal efforl. This is proved Ihrough a 
Lilerale Programming |(9l framework fhaf produces from a same source file bolh 
fhe description meanl for disseminalion and a soffware skelelon fo be compiled in 
sfandard C or C-i-i-. 

The resl of fhis article is sfruclured as follows: Section |2] inlroduces our tool. 
In Seel, [^we use if fo express fhree classical failure defectors. Section]^ is a case 
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1 . /* declarations */ 

TOM *tom; 
timeoutT tl, t2, t3; 

int my_alarm(TOM*), another_alarm(TOM*); 

2. /* definitions */ 

tom t— tom jnit(my.alarm); 

tom_declare(&tl, TOM.CYCLIC, TOM_SET.ENABLE, TIMEOUTl, SUBIDl, DEADEINEl); 
tom declare(&t2, TOM NON CYCEIC, TOM SET ENABEE, TIMEOUT2, SUBID2, DEADEINE2); 
tom_declare(&t3, TOM.CYCEIC, TOM.SET J)ISABEE, TIMEOUT3, SUBID3, DEADEINE3); 
tom_set_action(&t3, another .alarm); 

3. /* insertion */ 

tom_insert(tom, &tl), tom_insert(tom, &t2), tomTnsert(tom, &t3); 

4. /* control */ 
tom_enable(tom, &t3); 
tom_set_deadline(&t2, NEWX)EADEINE2); 
tomj‘enew(tom, &t2); 
tom_delete(tom, &tl); 

5. /* deactivation */ 
tom_close(tom); 

Table 2: Usage of the TOM class. In 1. a time-out list pointer and three time-out 
objects are declared, together with two alarm functions. In 2. the time-out list and 
the time-outs are initialized, and a new alarm is associated to time-out t3. Insertion 
is carried out at point 3. At 4. t3 is enabled and a new deadline value is specified 
for t2. The latter is renewed and tl is deleted. The list is finally deactivated in 5. 


study describing a software system built with our tool. Our conclusions are drawn 
in Sect.|5] 

2 Time-out Management System 

This section briefly describes the architecture of our time-out management system 
(TOM). The TOM class appears to the user as a couple of new types and a library 
of functions. Table [^provides an idea of the client-side protocol of our tool. 

To declare a time-out manager, the user needs to define a pointer to a TOM 
object and then call function tom.init. Argument to this function is an alarm, 
i.e., the function to be called when a time-out expires: 

int alarm(T0M *) ; tom = tom_init( alarm ); 
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The first time function tom_init is called a custom thread is spawned. That thread 
is the actual time-out manager. 

Now it is possible to define time-outs. This is done via type time out _t and 
function tom_declare; an example follows: 

timeout_t t; tom_declare(&t,TOM_CYCLIC, T0M_SET.ENABLE, 
TID, TSUBID, DEADLINE). 

In the above, time-out t is declared as: 

• A cyclic time-out (renewed on expiration; as opposed to T0M_N0N_CYCLIC, 
which means “removed on expiration”), 

• enabled (only enabled time-outs “fire”, i.e., call their alarm on expiration; an 
alarm is disabled with T0M_SET.DISABLE), 

• with a deadline of DEADLINE local clock ticks before expiration. 

A time-out t is identified as a couple of integers—TID and TSUBID in the 
above example. This is done because in our experience it is often useful to distin¬ 
guish instances of classes of time-outs. We use then TID for the class identifier and 
TSUBID for the particular instance. A practical example of this is given in Sect.|^ 

Once defined, a time-out can be submitted to the time-out manager for insertion 
in its running list of time-outs—see ifTOll for further details on this. From the user 
point of view, this is managed by calling function 

tom_insert( TOM *, timeout.t * ). 

Note that a time-out might be submitted to more than one time-out manager. 

After successful insertion an enabled time-out will trigger the call of the de¬ 
fault alarm function after the specified deadline. If that time-out is declared as 
TOM.CYCLIC the time-out would then be re-inserted. 

Other control functions are available: a time-out can be temporarily suspended 
while in the time-out list via function 

tom.disable( TOM *, timeout.t * ) 

and (re-)enabled via function 

tom_enable( TOM *, timeout.t * ). 

Furthermore, the user can specify anew alarm function viatom_set_action) 
and a new deadline via tom.set.deadline; can delete a time-out from the list 


via 


tom_delete( TOM *, timeout_t * ), 


and renewQit via 

tom_renew( TOM *, timeout_t * ). 

Finally, when the time-out management service is no longer needed, the user 
should call function 


tom_close( TOM * ), 

which also halts the time-out manager thread should no other client be still active. 

2.1 System assumptions, building blocks, and algorithms 

This section is to provide the reader with a clear definition of 

• the system assumptions our tool builds upon, 

• the architectural building blocks of our system, 

• the algorithms managing the list of time-outs. 

2.1.1 System assumptions 

Our tool is built in C for a generic Unix-like system with threads and standard inter¬ 
process communication facilities. Two implementation exists to date—one based 
on Embedded Parix ifTTI . the other using the standard Posix threads library ifT^ . A 
fundamental requirement of our model is that processes must have access to some 
local physical clock giving them the ability to measure time. The availability of 
means to control the priorities of processes is also an important factor to reducing 
the chances of late alarm execution. We also assume that the alarm functions are 
small grained both in CPU and I/O usage so as not to interfere “too much” with 
the tasks of the TOM. Finally, we assume the availability of asynchronous, non- 
blocking primitives to send and receive messages. 

2.1.2 Architectural building blocks 

Figure [TJportrays the architecture of our time-outs manager: in 

(1), the client process sends requests to the time-out list manager; in 
'Renewing a time-out means removing and re-inserting it. 


5 



(2), the time-out list manager accordingly updates the time-out list with the server- 
side protocol described in Sect. 2.1.3| 


(3) Each time a time-out reaches its deadline, a request for execution of the cor¬ 
responding alarm is sent to a task called alarm scheduler. 


(4) The alarm scheduler allocates an alarm request to the first available process 
out of those in a circular list of alarm processes, possibly waiting until one 
of them becomes available. 


Figure]^ shows the sequence diagram corresponding to the initialization of the 
system and the management of the first time-out request. 

The presence of an alarm scheduler and of the circular list of alarm processes 
can have great consequences on performance and on the ability of our system to 
fulfil real-time requirements. Such aspects have been studied in |[T0|| . Our system 
may also operate in a simpler mode, without the above mentioned two components 
and with the time-out list manager taking care of the execution of the alarms. 


2.1.3 Algorithms 

The server-side protocol is run by a component called time-out list manager (TLM). 
The TLM implements a well-known time-out queuing strategy that is described 
e.g. in |[T3ll . TLM basically checks every TM.CYCLE for the occurrence of one of 
these two events: 

• A request from a client has arrived. If so, TLM serves that request. 

• One or more time-outs have expired. If so, TLM executes the corresponding 
alarms. 

Each time-out t is characterized by its deadline t.deadline, a positive integer 
representing the number of clock units that must separate the time of insertion or 
renewal from the scheduled time of alarm execution. This field can only be set by 
functions tom.declare and tom_set_deadline. Each time-out t holds also a field, 
t.running, initially set to t.deadline. 

Each time-out list object, say tom, hosts a variable representing the origin of 
the time axis. This variable, tom.StartJime, regards in particular the time-out at 
the top of the time-out list—the idea is that the top of the list is the only entry 
whose running field needs to be compared with current time in order to verify the 
occurrence of the time-out-expired event. For the time-outs behind the top one, that 
field represents relative values, viz., distances from expiration time of the closest, 
preceding time-out. In other words, the overall time-out list management aims at 
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isolating a “closest to expiration” time-out, or head time-out, that is the one and 
only time-out to be tracked for expiration, and at keeping track of a list of “relative 
time-outs.” 

Let us call TimeNow the system function returning the current value of the 
clock register. In an ordered, coherent time-out list, residual time for the head 
time-out t is given by 

t.running - (TimeNow - tom.startJime), (1) 

that is, residual time minus time already passed by. Let us call quantity ([T]) as ri, or 
head residual. For time-out n, n > 1, that is for the time-out located n — 1 entries 
“after” the top block, let us define 

n 

rn = -h ^T.running (2) 

i=2 

as the n-th residual, or residual time for time-out at entry n. If there are m entries 
in the time-out list, let us define rj = 0 for any j > m. 

It is now possible to formally define the key operations on a time-out list: in¬ 
sertion and deletion of an entry. 

Insertion Three cases are possible, namely insertion on top, in the middle, and 
at the end of the list. 

Insertion on top. In this case we need to insert a new time-out object, say t, such 
that f. deadline < ri, or whose deadline is less than the head residual. Let 
us call u the current top of the list. Then the following operations need to be 
carried out: 

J t.running t- t.deadline-h TimeNow - tom.startJime 
\ u.running t- ri -1.deadline. 

Note that the first operation is needed in order to verify relation 

t.running - (TimeNow - tom.startJime) = t.deadline, 

while the second operation aims at turning the absolute value kept in the 
running field of the “old” head of the list into a value relative to the one 
stored in the corresponding field of the “new” top of the list. 

Insertion in the middle. In this case we need to insert a time-out t such that 


3j : rj < t.deadline < rj+i. 
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Let us call u time-out j + 1. (Note that both t and u exist by hypothesis). 
Then the following operations need to be carried out: 

J f.running ^ t.deadline - rj 
\ u.running ^ u.running - ^.running. 

Observation 1 Note how, both in the case of insertion on top and in that of 
insertion in the middle of the list, time interval [0,rm] has not changed its 
length — only, it has been further subdivided, and is now to be referred to as 
[ 0 , Vm+l]- 


Insertion at the end. Let us suppose the time-out list consists of m > 0 items, and 
that we need to insert time-out t such that f. deadline > Vm- In this case we 
simply tail the item and initialize it so that 

Lrunning ^ t.deadline - 

Observation 2 Note how insertion at the end of the list is the only way to prolong 
the range of action from a certain [0, r^] to a larger [0, Vm+i]- 

Deletion The other basic management operation on the time-out list is deletion. 
As we had three possible insertions, likewise we distinguish here deletion from 
top, from the middle, and from the end of the list. 

Deletion from top. If the list is a singleton we are in a trivial case. Let us suppose 
there are at least two items in the list. Let us call t the top of the list and u 
the next element—the one that will be promoted to top of the list. From its 
definition we know that 

r 2 = u.running-I-ri 

= tt.running -f f.running - (TimeNow - tom.startJime). (3) 

By Q, the bracketed quantity is the elapsed time. Then the amount of ab¬ 
solute time units that separate current time from the expiration time is given 
by M. running -|- f. running. In order to “behead” the list we therefore need 
to update t as follows: 

u.running ^ u.running -i- f.running. 


Deletion from the middle. Let us say we have two consecutive time-outs in our 
list, t followed by u, such that t is not the top of the list. With a reason¬ 
ing similar to the one just followed we get to the same conclusion—^before 
physically purging t off the list we need to perform the following step: 

rt.running ^ u.running -i-f.running. 

Deletion from the end. Deletion from the end means deleting an entry which is not 
referenced by any further item in the list. Physical deletion can be performed 
with no need for updating. Only, the interval of action is shortened. 

Observation 3 Variable tom.startJime is never set when deleting from or insert¬ 
ing entries into a time-out list, except when inserting the first element: in such case, 
that variable is set to the current value o/TimeNow. 

Figure shows the action of the server-side protocol: In 1,, a 330ms time-out 
called A is inserted in the list. In 2., after 1 00ms, A has been reduced to 230ms 
and a 400ms time-out, called B, is inserted (its value is 170ms, i.e., 400-230ms). 
Another 70ms have passed in 3., so A has been reduced to 160ms. At that point, 
a 5 1 0ms time-out, C is inserted and goes at the third position. In 4., after I60ms, 
time-out A occurs— B becomes then the top of the list; its decrementation starts. 
In 5. another 20ms have passed and B is at I50ms—at that point a 230ms time¬ 
out, called D is inserted. Its position is in between B and C, therefore this latter is 
adjusted. In 6., after 1 50ms, B occurs and D goes on top. 

3 Discussion 

In this section we show that the syntactical constructs in Table [T]can be expressed 
in terms of our class of time-outs. We do so by considering three classical failure 
detectors and providing their time-out based specifications. 

Let us consider the classical formulation of eventually perfect failure detector 
V |T|. The main idea of the protocol is to require each task to send a “heartbeat” 
to its fellows and maintain a list of tasks suspected to have failed. A task identifier 
q enfers fhe lisf of fask p if no hearfbeaf is received by p during a cerfain amounf 
of fime, Ap{q), initially set to a default value. This value is increased when late 
heartbeats are received. 

The basic structure of V is that of a coroutine with three concurrent processes, 
two of which execute a task periodically while the third one is triggered by the 
arrival of a message: 
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Every process p executes the following: 


outputp 0 
for all g G n 

Ap{q) ^ default time interval 

cobegin 

- Task 1: repeat periodically 

send “p-is-alive” to all 

- Task 2: repeat periodically 

for all g G n 

if g 0 outputp and p did not receive “g-is-alive” during 
the last 2sp{q) ticks of p’s clock then 
outputp ^ outputp U {g} 

- Task 3: when received “q-is-alive” for some q 

if q £ outputp 
outputp £- outputp — {(?} 

coend. 

We call the repeat periodically in Task 1 a “multiplicity 1” repeat, because 
indeed a single action (sending a “p-is-alive” message) has to be tracked, while we 
call “multiplicity q” repeat the one in Task 2, which requires to check q events. 

Our reformulation of the above code is as follows: 

Every process p executes the following: 

timeout_t ftaskl > ^task2 [NPROCS] ; 
task_t p, q; 

for (q=0; g<NPR0CS; q++) { 

Ap[q] = DEFAULT.TIMEOUT; 
outputplq] = TRUST; 

} 

/ * is our symbol for the “address-of” operator */ 

tom_declare('^f(jjg]^j , TOM_CYCLIC, T0M_SET.ENABLE, p, 0, Ap[q]) ; 
tom.set.action, action_Repeat_Taskl) ; 
tom.insert ('^ftaskl ^ > 
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for (q=0; g<NPR0CS; q++) { 
if (p q) { 

tom.declareTOM_CYCLIC, TOM_SET_ENABLE, q, 0, Ap[g]) 
tom_set_action(tj^g]j 2 +g 5 action_Repeat_Task2) ; 
tom.insert ('^ttask2) > 

} 

} 

do { 

getMessage ('^m) ; 
switch (m.type) { 

TASKl; 

TASK2; 

TASKS; 

} 

} forever; 

where tasks and aetions are defined as follows: 

TASKl = case REPEAT_TASK1: 

sendAll(I_AM_ALIVE) ; 
break; 

TASK2 = case REPEAT_TASK2: 
q = m.id; 

if ioutputp[q\ = TRUST) 
outputp[q\ = SUSPECT; 

break; 

TASKS = case I_AM_ALIVE: 

q = m. sender ; 

if ioutputp[q] = SUSPECT) { 
outputp [g] = TRUST; 

Ap((7) = Ap(^) + 1; 

} 

break; 

action_Repeat_Taskl0 { 
message _t m; 
m.type = REPEAT_TASK1; 

Send(m, p) ; 
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} 

action_Repeat_Task2(timeout_t *t) { 
message t m; 
m . type = REPEAT_TASK2; 
m.id = t->id; 

Send(m, p) ; 

} 

We can draw the following observations: 

• Our syntax is less abstract than the one adopted in the classical formulation. 
Indeed we have deliberately chosen a syntax very similar to that of program¬ 
ming languages such as C or C-i-i-. Behind the lines, we assume also a similar 
semantics. 

• Our syntax is more strongly typed: we have deliberately chosen to define 
(most of) the objects our code deals with. 

• We have systematically avoided set-wise operations such as union, comple¬ 
ment or membership by translating sets into arrays as, e.g., in 

output p ^ output p U {g}, 

which we changed into 

outputp[q] = PRESENT. 

• We have systematically rewritten the abstract constructs repeat perio¬ 
dically as one or more time-outs (depending on their multiplicity). Each 
of these time-out has an associated action that sends one message to the 
client process, p. This means that 

1. time-related event “it’s time to send p-is-alive to all” becomes event 
“message REPEAT_TASK1 has arrived.” 

2. time-related events “it’s time to check whether g-is-alive has arrived” 
becomes event “message (REPEAT_TASK2, id=q) has arrived.” 

• Due to the now homogeneous nature of the possible events (that now are all 
represented by message arrivals) a single process may manage those events 
through a multiple selection statement (a switch). In other words, no corou¬ 
tine is needed anymore. 
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Through the Literate Programming approach and a compliant tool such as 
CWEB lfT4l |9l it is possible to further improve our reformulation. As well known, 
the CWEB tool allows a pretty printable T^X documentation and a C file ready for 
compilation and testing to be produced from a single source code. In our experi¬ 
ence this link between these two contexts can be very beneficial: testing or even 
simply using the code provides feedback on the specification of the algorithm, 
while the improved specification may reduce the probability of design faults and in 
general increase the quality of the code. 

Eigure and Eigure respectively show a reformulation for the l-LB failure 
detector for partitionable networks fl] and for the group membership failure detec¬ 
tor |i6l] produced with CWEB. In those reformulations, symbols such as r and Vp 
are caught by CWEB and translated into legal C tokens via its “@f” construct lfT4]| . 
Note also that the expression m.path[q\ <PRESENT in Eig. [^means “q appears 
at most once in path”. A full description of these protocols is out of the scope 
of this paper—for that we refer the reader to the above cited articles. The focus 
here is mainly on the syntactical constructs used in them and our reformulations, 
which include simple translations for the syntactical constructs in Table in terms 
of our time-out API. A case worth noting is that of the group membership failure 
detector: here the authors mimic the availability of a cyclic time-out service but in¬ 
trude its management in their formulation. This management code can be avoided 
altogether using our approach. 

4 A development experience: the DIR net 

What we call “DIR net” ifTSll is the distributed application at the core of the software 
fault tolerance strategy realized through several European projects ifTSlfThll . In this 
section we describe the DIR net and report on how we designed and developed it 
by means of the TOM system. 

The DIR net is a fault-tolerant network of failure detectors connected to other 
peripheral error detectors (called “Dtools” in what follows). Objective of the DIR 
net is to ensure consistent fault tolerance strategies throughout the system and play 
the role of a backbone handling information to and from the Dtools lUSll . 

The DIR net consists of four classes of components. Each processing node in 
the system runs an instance of a so-called “I’m Alive Task” (lAT) plus an instance 
of either a “DIR Manager” (DIR-Ad), or a “DIR Agent” (DIR-A), or a “DIR Backup 
Agent” (DIR-i3). A DIR-A gathers all error detection messages produced by the 
Dtools on the current processing node and forwards them to the DIR-Ad and the 
DIR-.B’s. A DIR-.B is a DIR-A which also maintains its messages into a database 
located in central memory. It is connected to DIR-Ad and to the other DIR-^B’s and 
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Time-out 

Caller 

Action 

Cyclic? 

flA.SET 

DIR-x 

On TimeNow -i- (iiA_sET do send miA_sETj\LARM to Caller 

Yes 

flA.CLR 

lAT 

On TimeNow -i- djA CLR do send m-iA CLR ALARM to lAT 

Yes 


Table 3: Description of messages m,A.sET_ALARM and miA.cLR_ALARM- 


Message 

Receiver 

Explanation 

Action 

^IA_SETjALARM 

^IA_CLR_ALARM 

DIR-x 

lAT A: 

Time to set lAF 
Time to check lAF 

lAF ^ TRUE 

if (lAF = FALSE) SendAh(mTEIE^ 
else lAF ^ FALSE, 


Table 4: Description of time-outs fiA SET and tiA n r 


is eligible for election as a DIR-Ad. A DIR-Af is a special case of DIR-i3. Unique 
within the system, the DIR-Af is the one component responsible for running error 
recovery strategies—see |[T5]I for a description of the latter. Let us use DIR-x to 
address any non-IAT component (i.e. the DIR-Ad, or a DIR-;B, or a DIR-A.) 

An important design goal of the DIR net is that of being tolerant to physical 
and design faults, both permanent or intermittent, affecting up to all but one DIR-.B. 
This is accomplished also through a failure detection protocol that we are going to 
describe in the rest of this section. 

4.1 The DIR net failure detection protocol 

Our protocol consists of a local part and a distributed part. Each of them is realized 
through our TOM class. 

4.1.1 DIR net protocol: local component 

As we aheady mentioned, each processing node hosts a DIR-x and an lAT. These 
two components run a simple algorithm: they share a local Boolean variable, the 
“Fm Alive Flag” (lAF). The DIR-x has to periodically set the lAF to TRUE while 
the lAT has to check periodically that this has indeed occurred and reverts lAF to 
FALSE. If the lAT finds the lAF set to FALSE it broadcasts message ttiteif (“this 
entity is faulty”). 

The cyclic tasks mentioned above can be easily modeled via two time-outs, 
fiA SET and fiA_cLR> described in Table and Table (TimeNow being the system 
function returning the current value of the clock register.) 

Note that the time-outs’ alarm functions do not clear/set the flag—doing so 
a hung DIR-x would go undetected. On the contrary, those functions trigger the 
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transmission of messages that once received by healthy components trigger the 
execution of the meant actions. 

The following is a pseudo-code for the lAT algorithm: 

The lAT k executes as follows: 

timeout.t fiACLRi 

msg_t activationMessage, m ; 

tom_declare ('^fjA.cLR» TOM_CYCLIC, 

TOM.SET.ENABLE, IAT_CLEAR_TIMEOUT, 0, diACLR) 5 
tom_set_action('^VcLR, actionSendmiA_cLR.ALARM) ; 
tom.insert ('vAtjACLR) : 

'Recei.sreiactivationMessage') ; 

forever { 

Receive (m) ; 
if (m.type = miA.cLR_ALARM) 

if (lAF = TRUE) lAF ^ FALSE; 

else SendAlKmTEiF, k) ; de 1 ete.timeout ('^fiACLR) ; 

} 

actionSendmiA CLRj^LARM 0 { Send(7TiiA_cLR_ALARM j lAT fe) ; } 

The time-out formulation of the lAT algorithm is given in next section. 

4.1.2 DIR net protocol: distributed component 

The resilience of the DIR net to crash faults comes from the DIR-Ad and the 
DIR-i3’s running the following distributed algorithm of failure detection: 

Algorithm DIR-Af Let us call mid the node hosting the DIR-Ad and b the num¬ 
ber of processing nodes that host a DIR-;B. The DIR-Af has to send cyclically a 
n^MiA (“Manager-Is-Alive”) message to all the DIR-;B’s each time time-out ^mia a 
expires—this is shown in the right side of Fig. Obviously this is a multiplicity 
b “repeat” construct, which can be easily managed through a cyclic time-out with 
an action that signals that a new cycle has begun. In this case the action is “send a 
message of type m-MiAj^j^LARM to the DIR-Af.” 

The manager also expects periodically a (m-rAiA, *) message (“This-Agent-Is- 
Alive”) from each node where a DIR-;B is expected to be running. This is easily 
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accomplished through a vector of (frAiA A, i) time-outs. The left part of Fig.|^shows 
this for node i. When time-out (tTAiA_A,p) expires it means that no {mjMA,p) mes¬ 
sage has been received within the current period. In this case the DIR-Af enters 
what we call a “suspicion period”. During such period the manager needs to dis¬ 
tinguish the case of a late DIR-I3 from a crashed one. This is done by inserting a 
non-cyclic time-out, namely (fTEiF.A,p)- 

During the suspicion period only one out of the following three events may 
occur: 

1. A late (mTAiA)P) is received. 

2. A (mTEiF) p) from lAT at node p is received. 

3. Nothing comes in and the time-out expires. 

In case I. we get out of the suspicion period, conclude that DIR-B at node p 
was simply late and go back waiting for the next (mxAiA, p)- 

It is the responsibility of the user to choose meaningful values for the time-outs’ 
deadlines. By “meaningful” we mean that those values should match the charac¬ 
teristics of the environment and represent a good trade-off between the following 
two risks: 

overshooting, i.e., choosing too large values for the deadlines. This decreases the 
probability of false negatives (regarding a slow process as a failed process; 
this is known as accuracy in failure detection terminology) but increases the 
detection latency; 

undershooting, namely under-dimensioning the deadlines. This may increase 
considerably false negatives but reduces the detection latency of failed pro¬ 
cesses. 

Under the hypotheses of properly chosen time-outs’ deadlines, and that of a 
single, stable environmenj^ the occurrences of late {m-iAiAiP) messages should 
be exceptional. This event would translate in a false deduction uncovered in the 
next cycle. Further late messages would postpone a correct assessment, but are 
considered as an unlikely situation given the above hypotheses. An alternative and 
better approach would be to track the changes in the environment. For the case at 
hand this would mean that the time-outs’ deadlines should be adaptively adjusted. 
This could be possible, e.g., through an approach such as in |[T9]| . 

^We call an environment “stable” when it does not change drastically its characteristics except 
under erroneous and exceptional conditions. Single environments are typical of fixed (non-mobile) 
applications. 
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If 2. is the case we assume the remote component has crashed though its node is 
still working properly as the lAT on that node still gives signs of life. Consequently 
we initiate an error recovery step. This includes sending a “WAKEUP” message to 
the remote lAT so that it spawns another DIR-.B on that node. 

In case 3. we assume the entire node has crashed and initiate node recovery. 

Underlying assumption of our algorithm is that the lAT is so simple that if it 
fails then we can assume the whole node has failed. 


Algorithm DIR-i3 This algorithm is also divided into two concurrent tasks. In the 
first one DIR-i3 on node i has to cyclically send (rriTAiA, *) messages to the manager, 
either in piggybacking or when time-out Uaiaji expires. This is represented in the 
right side of Fig. [7] 

The DIR-i3’s in turn periodically expect a ttimia message from the DIR-Af. As 
evident when comparing Fig. with Fig.[7| the DIR-i3 algorithm is very similar to 
the one of the manager: also DIR-i3 enters a suspicion period when its manager 
does not appear to respond quickly enough—this period is managed via time-out 
^TEIF.B^ the same way as in DIR-Af. Also in this case we can get out of this state in 
one out of three possible ways: either 

1. a late (mMiA.B_ALARMj mid) is received, or 

2. a (mxEiF, mid) sent by the lAT at node mid is received, or 

3. nothing comes in and the time-out expires. 


In case I. we get out of the suspicion period, conclude that the manager was simply 
late, go back to normal state and start waiting for the next (mMiA,mid) message. 
Also in this case, a wrong deduction shall be detected in next cycles. If 2. we 
conclude the manager has crashed though its node is still working properly, as 
its lAT acted as expected. Consequently we initiate a manager recovery phase 
structured similarly to the DIR-i3 recovery step described in Sect. 4.1.2 In case 3. 
we assume the node of the manager has crashed, elect a new manager among the 
DIR-.B’s, and perform a node recovery phase. 

Table [^summarizes the DIR-Al and DIR-;B algorithms. 

We have developed the DIR net using the Windows TIRAN libraries l(T^ and 
the CWEB system of structured documentation. 


4.2 Special services 

4.2.1 Configuration 

The management of a large number of time-outs may be an error prone task. To 
simplify it, we designed a simple configuration language. Figure shows an ex- 


17 



Time-out 

Caller 

Action 

Cyclic? 

tMlA_A 

DIR-Ad 

Every cImiaj, do send mMiA.A.ALARM to DIR-Ad 

Yes 

f TAIA_A [f ] 

DIR-Ad 

Every c^taia^a do send (mTAiA^^xARM, i) to DIR-Af 

Yes 

fxEIFj^. [f ] 

DIR-Ad 

On TimeNow -i- dTEiF.A do send (mTEiF.A.ALARM, *) to DIR-Af 

No 

fxAIAJ 

DIR-.B j 

Every ((taiaj do send mTAiAj^LARM to DIR-B j 

Yes 

flUIAJ 

DIR-^ j 

Every duiAs do send mMiA.B^LARM to DIR-.B j 

Yes 

fxEIF B 

DIR-i3 j 

On TimeNow -i- dTEiF.B do send mTEiF.B ALARM to DIR-i3 j 

No 


Message 

Receiver 

Explanation 

Action 

(^XAIA) f) 

DIR-Af 

DIR-B i is OK 

(Re-)Insert or renew txAiAx [f] 

^MIA.Aj^LARM 

DIR-Af 

A new heartbeat is required 

Send mMiA to all DIR-;B’s 

?^XAIA_Aj\LARM 

DIR-Af 

Possibly DIR-;B i is not OK 

Delete fTAiA.A[f], insert tTEiF.A[f] 

(mXEIF) *) 

DIR-Af 

DIR-;B i crashed 

Declare DIR-.B i crashed 

( ^XEIF_A_ALARM) * ) 

DIR-Af 

Node i crashed 

Declare node i crashed 

^MIA 

DlR-B j 

DIR-Af is OK 

Renew ^mia b 

?^XAIA3.ALARM 

DlR-B j 

A new heartbeat is required 

Send (mxAiA, j) to DIR-Ad 

?^MIA_Bj\LARM 

DIR-B j 

Possibly DIR-Ad is not OK 

Delete fMiAXi, insert fxEiFXi 

TTIxeif 

DlR-B j 

DIR-Af crashed 

Declare DIR-Ad crashed 

^XEIF Bj\LARM 

DIR-B j 

DIR-Af’s node crashed 

Declare DIR-Ad’s node crashed 


Table 5: Time-outs and messages of DIR-Ad and DIR-S. 


ample of configuration script to specify the structure of the DIR net (in this case, 
a four node system with three DIR-i3’s deployed on nodes 1-3 and the DIR-Ad on 
node 0) and of its time-outs. A translator produces the C header files to properly 
initialize an instance of the DIR net (see Fig.[^. 

4.2.2 Fault injection 

Time-outs may also be used to specify fault injection actions with fixed or pseudo¬ 
random deadlines. In fhe DIR net this is done as follows. First we define fhe 
time-ouf: 

#ifdef INJECT 

tom_declare(Sinject, TOM_NON_CYCLIC, TOM_SET_ENABLE, 

INJECT_FAULT_TIMEOUT, i, INJECT_FAULT_DEADLINE); 
tom_insert(tom, Sinject); 

#endif 

The alarm of this time-out sends the local DIR-x a message of type “INJE- 
CT_FAULT_TIMEOUT”. Figure [T0| shows an excerpt from the actual main loop of 
the DIR-Af in which this message is processed. 


18 

















4.2.3 Fault tolerance 

A service such as TOM is indeed a single-point-of-failure in that a failed TOM 
in the DIR net would result in all components being unable to perform their fail¬ 
ure detection protocols. Such a case would be indistinguishable from that of a 
crashed node by the other DIR net components. As well known from, e.g., iGOll . 
a single design fault in TOM’s implementation could bring the system to a global 
failure. Nevertheless, the isolation of a service for time-out management may pave 
the way for a cost-effective adoption of multiple-version software fault tolerance 
techniques 11211 such as the well known recovery block Il22ll . or A"-version pro¬ 
gramming ll23]l . Another possibility would be to use the DIR net algorithm to 
tolerate faults in TOM. No such technique has been adopted in the current imple¬ 
mentation of TOM. Other factors, such as congestion or malicious attacks might 
introduce performance failures that would impact on all modules that depend on 
TOM to perform their time-based processing iflOll . 

5 Conclusions 

We have introduced a tentative lingua franca for the expression of failure detection 
protocols. TOM has the advantages of being simple, elegant and not ambiguous. 
Obvious are the many positive relapses that would come from the adoption of a 
standard, semi-formal representation with respect to the current Babel of informal 
descriptions—easier acquisition of insight, faster verification, and greater ability to 
rapid-prototype software systems. The availability of a tool such as TOM is also 
one of the requirements of the timed-asynchronous system model |[25]l . 

Given the current lack of a network service for failure detection, the availabil¬ 
ity of standard methods to express failure detectors in the application layer is an 
important asset: a tool like the one described in this paper isolates and crystallizes 
a part of the complexity required to express failure detection protocols. This com¬ 
plexity may become transparent of the designer, with tangible savings in terms of 
development times and costs, if more efforts will be devoted to time-outs config¬ 
uration and automatic adjustments through adaptive approaches such as the one 
described in llT^ . Such optimizations will be the subject of future research. Fu¬ 
ture plans also include to port our system to AspectJ 12^ so as to further enhance 
programmability and separation of design concerns. 

As a final remark we would like fo poinf ouf how, af the core of our design 
choices, is the selection of C and literate programming, which proved to be invalu¬ 
able tools to reach our design goals. Nevertheless we must point out how these 
choices may turn into intrinsic limitations for the expressiveness of the resulting 
language. In particular, they enforce a syntactical and semantic structure, that of 
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the C programming language, which may be regarded as a limitation by those re¬ 
searchers who are not accustomed to that language. At the same time we would 
like to remark also that those very choices allow us a straightforward translation of 
our constructs into a language like Promela |[26]l . which resembles very much a C 
language augmented with Hoare’s CSP |[27l . Accordingly, our future work in this 
framework shall include the adoption of the Promela extension of Prof. Bosnacki, 
which allows the verification of concurrent systems that depend on timing parame¬ 
ters ||28l. Interestingly enough, this version of Promela includes new objects, called 
discrete time countdown timers, which are basically equivalent to our non-cyclic 
time-outs. Our goal is to come up with a tool that generates from the same literate 
programming source (1) a pretty printout in TgX, (2) C code ready to be compiled 
and run, and (3) Promela code to verify some properties of the protocol. 
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Figure 1: Architecture of the time-out management system. 
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Figure 2: Sequence diagram for the tasks of the time-outs manager. 
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Figure 3: Operating scenario of the time-out manager. 
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1. Code of the HB failure detector for partitionable networks. 
Aguilera, Chen and Tbiieg, Theoretical Computer Science n.l. 1999. 

#detine HEARTBEAT 1 
#deftne ITTB 2 
#deftne SOHE.PERIOD 100000 
#define FOREVER 1 
#deftiie PRESENT 1 
#define ABSENT 0 
(Initialisation j) 

2. Every process p executes the following: 

(Initialisation 2 ) = 

main {) 

{ 

timeout^t Tt 2 b: 
message.t m; 

for (9 = 0; q< NPROCS: 9 - 1 ^) { 

= 0 : 

path[q] = ABSENT: 

} 

fom_dec/are(&rt 2 b. TOM.CYCLIC. T 0 M_SET_ENABLE. 1,1.1); 
toTn-set-action{S^Tt 2 h-actionltsTimeToBroadcast)'. /* sends ITTB 
tom_seCdea(fhne{&rt 2 b.S 0 ME_PERI 0 D): /* every 100000 ticks */ 

tom-insert (& nextb ): 

do { 

9 efA/esso 9 e(&m); /* sets m.date */ 
switch (m.type) { 

(Taskl :{) 

(Task2 1 ) 

} 

} whilp (FOREVER): 

} 

This code is used in section I. 


3. Task 1 
(Taskl ',) = 

case ITTB: Vp\p] = Vp\p\ + 1; 

771.type = HEARTBEAT, m.pat/i = p', 
for (9 = 0; 9 < NPROCS: 9 ++) 

if {isneighbor{q.p)) sendMessage(m,q)\ 
break; 

This code is used in section .. 

4. Code of Task 2 
(Task 2 1 ) = 

case HEARTBEAT: 

for (9 = 0: 9 < NPROCS; 9 ++) 

if [m.path[q] ^ ABSENT) "Dplp] = Pp[p] + 1: 
m.pathlp] = m.path\p] + 1; 
for (9 = 0; 9 < NPROCS: 9 ++) 

if (wnei 5 ftfeo 7 '( 9 ,p) A 7 n.paf/i[ 9 ] < PRESENT) sendMessase{m,q)\ 
break; 

This code is used in section 

5. Extra functions 

hit actionItsTimeToBroadcast{) J* sends ITTB to caller *f 

{ 

sendMcssage (ITTB. p ); 

} 


Figure 4: Reformulation of the HB failure detector for partitionable networks l|3. 
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1. Code of a group membership failure detector. 

Raynal and TVoiiel. Distributed SjTstems Engineering C (1999) 95-102. 

#deflue ITTS 1 
#deflne ITTB 2 
#deflue I_AM_ALIVE 3 
(Initialisation 2) 

2. Every process p executes the following: 

(Initialisation 2 ) = 

main() 

{ 

timeout.t Tnactt. ^-bcast: 

date-t nextTimeouti. timeoutj[NPROCS]. nej:tBroadcasti'. 
boolean_t groupFailurVi. 
message_t m: 
task_t j; 

groupFailuref = False-, 
neitBjvadcasti = getCurrentDate{)-, 
for (9 = 0; 9<NPR0CS: q++) { 

^fmeout.5[9] = MAX_DATE; 
r,[«l=0; 

I 

nextTimeouti = mtn(timeouti. NPROCS); 

tom.set_action(SeTaaxn-<^^ilo^t^TimeToStopy, /* sends message ITTS *■/ 
tom^set-deadline (irrhcaat -Tr)', 
toTn.insert (&TnBxtt): 

tom^et-actionl&cTiif^x. actionltsTimeToBroadcast)-, /* sends message 
to7u_seLdead/ine (Srrbcaat •Te)', 
tomjnsert (<S.:Tbcast): 
while {-'gwupFaiiufv^) { 

getMessage{l:m)', /* sets m.date */ 
j = m.sender-, 
switch {m.type) { 

(Taskl 3} 

{Task2 1 ) 

(Tasks 5) 

) 

I 

} 

This code is used in section 1. 


3. Task 1 

(Task 1 3 ) = 

case ITTB: sendA/essa< 7 eyli({I_AM_ALIVE. 1)*): /* send "i is alive" to all */ 

to7n_t>iaert(iScTbcaat): /* a cyclic timeout could have been used here */ 

bi=bi+U 
break: 

This code is used in section 2. 

4. Code of Task 2 
(Task2 1) = 

case ITTS: groupFailufe^ = True-, 
break: 

This code is used in section 2. 

5. Code of Task 3 
(Tasks '■) = 

case I_AM_ALIVE: ti7neouti[j] = 7n.date + T^; 
nextTimeouti = mm(f?meouti, NPROCS): 
tom_sct_deadh7ie(&TbcaBt' nextTimeouti)-, 
toriL-insej-t (<Scrbca,t): 
ri[j] = r,[j] + 1; 
break; 

,^This code is used in section 2. 

6. Ancillary functions. 

int actionltsTimeToStopO /* sends message ITTS +/ 

{ 

sendA/essnpe(ITTS. j): 

} 

int actionItsTimeToBroadcast() /* sends message ITTB */ 

{ 

sendMessageilTTB. i): 

) 


Figure 5: Reformulation of the group membership failure detector @. 
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No tjme>out 


No time-out 



Figure 6: Algorithm of the DIR-A/1. 
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No time-out No time-out 



Figure 7: Algorithm DIR-;B. 
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# include files 

# defines are importable from include files via #include statements 
INCLUDE "ray_definitions.h" 

INCLUDE ./BACKBONE.H" 

# definitions 

# definitions start with the 'DEFINE' keyword, followed 

# by an integer, an interval, or a list, followed 

# by the equal sign and a role, that may be 

# ASSISTANTsI or MANAGER 
NPROCS = 4 

DEFINE 2-4 = ASSISTANTS 
DEFINE 1 = MANAGER 

# NPROCS = 2 

# DEFINE 2 = ASSISTANT 

MIA_SEND_TIMEOUT = 800000 # Manager Is Alive -- manager side 

TAIA_RECV_TIMEOUT ■ 1800000 # This Agent Is Alive timeout -- manager side 

MIA_RECV_TIME0UT = 1500000 # Manager Is Alive — backup side 

TAIA_SEND_TIMEOUT = 1000000 # This Agent Is Alive timeout -- backup side 

TEIF_TIMEOUT ■ 1800000 # after this time a suspected node is assumed 

# to have crashed. 

I'M ALIVE_CLEAR_TIMEOirr = 900000 # I'm Alive timeout -- clear lA flag 

I'M ALIVE_SET_TIMEOUT = 1400000 # I'm Alive timeout -- set and checks lA flag 

REQUEST_DB_TIME0UT = 2000000 
REPLY_DB_TIME0UT = 4000000 

MID_TIMEOUT = 1000000 # if a TEIF is receaved, up to MID_TIMEOUT ticks 

# are allowed for reintegrating a new manager, 

# otherwise, the node of the manager is considered 

# to be dead. 


Figure 8: Excerpt from the configuration script of the DIR net. 
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ba;sh-2.05b$ art -s 

Ariel translator, v4.Og 2-Dec-20Q4, (c) 2004 Universiteit Antwerpen. 
Parsing file .ariel... 

[ Including file 'my_definitions.h' ...9 associations have been stored. ] 
[ Including file ./BACKBONE.H' ...55 associations have been stored. ] 

if-then-else: ok 
.. .done (148 lines.) 

Output written in file .rcode. 

Watchdogs configured. 

N-version tasks configured. 

Logicals written in file LogicalTable.csv. 

Tasks written in file TaskTable.csv. 
static version 

Preloaded r-codes written in file ../trl.h. 

Time-outs written in file ../timeouts.h. 

Identifiers written in file ../identifiers.h. 

Alpha-count parameters written in file ../alphacount.h. 

Press ■'C to finish processing... 


Figure 9: Configuration tool of the DIR net. 
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18. This loop is the real core of the manager. It has to deal with a number of messages coming from 
the timeout manager, its fellow backups, the recovery thread, the remote I’m Alive Tasks. The core of the 
fault-tolerant strategy of the DIR net is in here. 

(manager loop (waiting for incoming messages) 18 ) = 
while (1) { 

(went for an incoming messa^ 53) 
tom-dump (tom)] 
switch {message.type) { 
case INJECT_FAULT_TIMEOUT: 

Lc»j£?rror(EC_ERROR, "Manageruloop", "Faultuinjection"); 
tom-dose{tom)] /* the time-out manager is detached */ 

break; 

case IA_FLAG_TIMEOUT: 

Lcg£?rror(EC_ERROR, "Manageruloop", "IA_FLAG_TIMEOUTuinessageu->uClearuIA-flag."); 

/* time to clear the lA-flag! */ 

(clear lA-flag le) 

break; 

case MIA.TIMEOUT: 

Lc»ji?rror(EC_ERROR, "Manageruloop", 

"MIA_TIMEOUTumessag 0 uCtiineutouS 0 nduauMIAutouBackupu*/gd) . message.subid)\ 

/* time to send a MIA to a backup */ 

(send MIA to backup suMd 19) 
tom.dump{tom + message.subid); 
tom-renew (tom, mia -\- message.subid)-, 
break; 


Figure 10: Excerpt from the CWEB source of the DIR net. 
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